Session 6: Introduction to the Web

Introduction to Web Scraping and Data Management for Social Scientists

Johannes B. Gruber

2024-07-29

Introduction

This Course

tinytable_tuxwopi5k8cmoroe39cv
Day Session
1 Introduction
2 Data Structures and Wrangling
3 Working with Files
4 Linking and joining data & SQL
5 Scaling, Reporting and Database Software
6 Introduction to the Web
7 Static Web Pages
8 Application Programming Interface (APIs)
9 Interactive Web Pages
10 Building a Reproducible Research Project

The Plan for Today

In this session, we learn how to scout data in the wild. We will:

  • discuss web scraping from a theoretical point of view:
    • What is web scraping?
    • Why should you learn it?
    • What legal and ethical implications should you keep in mind?
  • learn a bit more about how the Internet works
    • What is HTML
    • What is CSS

Angie Gade via unsplash.com

What is Web Scraping

Forms to get data from the web

  • Download data from a website
  • Retrieve data via an API
  • Scrape the (unstructured) Data

Image Source: daveberesford.co.uk

Web Scraping

  • Used when other means are unavailable
  • Scrape the (unstructured) Data from a website
    • E.g: get the author, title, date and body of an online news article
    • E.g: get a table from a website (like Wikipedia)
    • E.g: get all entries from a blog
    • E.g: get all press statements from a political party website
    • E.g: get all hyperlinks to files on a website
  • A web-scraper is a program (or robot) that:
    • goes to a web page
    • downloads its content
    • extracts data from the content
    • then saves the data to a file or a database

Web Scraping: A Three-Step Process

  1. Send an HTTP request to the webpage -> server responds to the request by returning HTML content
  2. Parse the HTML content -> extract the information you want from the nested structure of HTML code
  3. Wrangle the data into a useful format

Original Image Source: prowebscraper.com

Hurdles

  • Unfortunately no one-size-fits-all solution
    • Lots of different techniques, tools, tricks
    • Websites change (some more frequently than others)
  • Some web pages are easier to scrape than others (by accident or on purpose!):
  1. Well behaved static HTML with recurring patterns
  2. Haphazard HTML not clearly differentiating between different types of information
  3. Interactive web sites loading content by executing code (usually JavaScript or PHP)
  4. Interactive web sites with mechanisms against data extraction (rate limits, captchas etc.)
  5. Collecting raw data might be limited, data wrangling might be made difficult

Why Should You Learn Web Scraping?

  • The internet is a data gold mine!
  • Data was not created for research, but are often traces of what people are actually doing on the internet
  • Reproducible and renewable data collection (e.g., rehydrate data that is copyrighted)
  • Web Scraping let’s you automate data retrieval (as opposed to using tedious copy & past on some web site)
  • It’s one of the most fun tasks to learn R and programming!
    • It’s engaging and satisfying to find repeating patterns that you can employ to structure data (every website becomes a little puzzle)
    • It touches on many important computational skills
    • The return is good data to further your career (unlike sudokus or video games)

Found vs Designed Data

Designed Data

  • Collected for research
  • Full control of shape and form
  • Problems of validity due to social desirability and imperfect estimation problems

Found Data

  • Traces of human behavior
  • Comes in all shapes and forms
  • Problems of validity as not representative and incomplete access

Implications of Web Scraping

ToS and Robots.txt

Twitter ToS

User-agent: *                         # the rules apply to all user agents
Disallow: /EPiServer/CMS/             # do not crawl any URLs that start with /EPiServer/CMS/
Disallow: /Util/                      # do not crawl any URLs that start with /Util/ 
Disallow: /about/art-in-parliament/   # do not crawl any URLs that start with /about/art-in-parliament/

https://www.parliament.uk/robots.txt

Ethical

  • Are there other means available to get to the data (e.g., via an API)?
  • robots.txt might not be legally binding, but it is not nice to ignore it
  • Scraping can put a heavy load on website (if you make 1000s of requests) which costs the hosts money and might bring down a site (DDoS attack)
  • Think twice before scraping personal data. You should ask yourself:
    • is it necessary for your research?
    • are you harming anyone by obtaining (or distributing) the data?
    • do you really need everything or are parts of the data sufficient (e.g., can you preselect cases or ignore variables)

Advice?

Legal and ethical advice is rare and complicated to give. A good opinion piece about it is Freelon (2018). It is worth reading, but can be summarised in three general pieces of advice

  • Use authorized methods whenever possible
  • Do not confuse terms of service compliance with data protection
  • Understand the risks of violating terms of service

Exercises 1

Twitter recently made access to their API punishingly expensive and stopped free academic access for research. If you wanted to do research on Twitter data through web-scraping anyway what implications would that have:

  1. Legally

  2. Ethically

  3. Practical

What are HTML and CSS

What is HTML

  • HTML (HyperText Markup Language) is the standard markup language for documents designed to be displayed in a web browser
  • Contains the raw data (text, URLs to pictures and videos) plus defines the layout and some of the styling of text
{\displaystyle \overbrace {\overbrace {{\mathtt {\color {BrickRed}<\!p\ }}\color {Magenta}\underbrace {\mathtt {class}} _{\mathsf {\color {Black}{Attribute \atop name}}}{\mathtt {=''}}\!\underbrace {\mathtt {paragraph}} _{\mathsf {\color {White}{Attr} \atop \color {Black}Attribute\ value}}''{\mathtt {\color {BrickRed}>}}} ^{\mathsf {Start\ tag}}\overbrace {\mathtt {\color {Green}This\ is\ a\ paragraph.}} ^{\mathsf {Content}}\overbrace {\mathtt {\color {BrickRed}<\!/p\!>}} ^{\mathsf {End \atop tag}}} ^{\mathsf {Element}}}

Image Source: Wikipedia.org

Example: Simple

Code:

<!DOCTYPE html>
<html>
<head>
    <title>My Simple HTML Page</title>
</head>
<body>
    <p>This is the body of the text.</p>
</body>
</html>

Browser View:

Example: With headline and author

Code:

<!DOCTYPE html>
<html>
<head>
    <title>My Simple HTML Page</title>
</head>
<body>
    <h1>My Headline</h1>
    <p class="author" href="https://www.johannesbgruber.eu/">Me</p>
    <p>This is the body of the text.</p>
</body>
</html>

Browser View:

Example: With some data

Code:

<!DOCTYPE html>
<html>
<head>
    <title>My Simple HTML Page</title>
</head>
<body>
    <h1>My Headline</h1>
    <p class="author">Me</p>
    <p>This is the body of the text.</p>
    <p>Consider this data:</p>
    <table>
        <tr>
            <th>Name</th>
            <th>Age</th>
        </tr>
        <tr>
            <td>John</td>
            <td>25</td>
        </tr>
        <tr>
            <td>Mary</td>
            <td>26</td>
        </tr>
    </table>
</body>
</html>

Browser View:

Example: With an image

Code:

<!DOCTYPE html>
<html>
<head>
    <title>My Simple HTML Page</title>
</head>
<body>
    <h1>My Headline</h1>
    <p class="author">Me</p>
    <p>This is the body of the text.</p>
    <p>Consider this image:</p>
    <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/0c/About_The_Dog.jpg/640px-About_The_Dog.jpg" alt="About The Dog.">
</body>
</html>

Browser View:

What is CSS

  • CSS (Cascading Style Sheets) is very often used in addition to HTML to control the presentation of a document
  • Designed to enable the separation of content from things concerning the look, such as layout, colours, and fonts.
  • The reason it is interesting for web scraping is that certain information often get the same styling

Example: CSS

HTML:

<!DOCTYPE html>
<html>
<head>
    <title>My Simple HTML Page</title>
    <link rel="stylesheet" type="text/css" href="example.css">
</head>
<body>
  <h1 class="headline">My Headline</h1>
  <p class="author">Me</p>
  <div class="content">
    <p>This is the body of the text.</p>
    <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/0c/About_The_Dog.jpg/640px-About_The_Dog.jpg" alt="About The Dog.">
    <p>Consider this data:</p>
    <table>
      <tr class="top-row">
          <th>Name</th>
          <th>Age</th>
      </tr>
      <tr>
          <td>John</td>
          <td>25</td>
      </tr>
      <tr>
          <td>Mary</td>
          <td>26</td>
      </tr>
    </table>
  </div>
</body>
</body>
</html>

CSS:

/* CSS file */

.headline {
  color: red;
}

.author {
  color: grey;
  font-style: italic;
  font-weight: bold;
}

.top-row {
  background-color: lightgrey;
}

.content img {
  border: 2px solid black;
}

table, th, td {
  border: 1px solid black;
}

Browser View:

Exercises 2

  1. Add another image and another paragraph to data/example.html and display it in your browser
  2. Add a second level headline to the page
  3. Add another image to the page
  4. Manipulate the files data/example.html and/or data/example.css so that “content” is displayed in italics

HTMl and CSS in Web Scraping: a preview

Using HTML tags:

You can select HTML elements by their tags

library(rvest)
read_html("data/example.html") |> 
  html_elements("p") |> 
  html_text2()
[1] "Me"                            "This is the body of the text."
[3] "Consider this image:"          "Consider this data:"          
  • to select them, tags are written without the <>
  • in theory, arbitrary tags are possible, but commonly people use <p> (paragraph), <br> (line break), <h1>, <h2>, <h3>, … (first, second, third, … level headline), <b> (bold), <i> (italic), <img> (image), <a> (hyperlink), and a couple more.

Using attributes

You can select elements by an attribute, including the class:

read_html("data/example.html") |> 
  html_element("[class=\"headline\"]") |> 
  html_text()
[1] "My Headline"

For class, there is also a shorthand:

read_html("data/example.html") |> 
  html_element(".headline") |> 
  html_text()
[1] "My Headline"

Another important shorthand is #, which selects the id attribute:

read_html("data/example.html") |> 
  html_element("#table-1") |> 
  html_table()
# A tibble: 2 × 2
  Name    Age
  <chr> <int>
1 John     25
2 Mary     26
read_html("data/example.html") %>% 
  html_element("#table-1 > tr")
{html_node}
<tr class="top-row">
[1] <th>Name</th>
[2] <th>Age</th>

Extracting attributes

Instead of selecting by arrtibute, you can also extract one or all attributes:

read_html("data/example.html") |> 
  html_elements("a") |> 
  html_attr("href")
[1] "https://www.johannesbgruber.eu/"   "https://en.wikipedia.org/wiki/Dog"
read_html("data/example.html") |> 
  html_elements("a") |> 
  html_attrs()
[[1]]
                             href 
"https://www.johannesbgruber.eu/" 

[[2]]
                               href 
"https://en.wikipedia.org/wiki/Dog" 

Chaining selectors

If there is more than one element that fits your selector, but you only want one of them, see if you can make your selection more specific by chaining selectors with > (for the immediate next one) or an empty space (for any children of an element):

read_html("data/example.html") |> 
  html_elements(".author>a") |> 
  html_attr("href")
[1] "https://www.johannesbgruber.eu/"
read_html("data/example.html") |> 
  html_elements(".author a") |> 
  html_attr("href")
[1] "https://www.johannesbgruber.eu/"

Tip: there is also no rule against doing this instead:

read_html("data/example.html") |> 
  html_elements(".author") |> 
  html_elements("a") |> 
  html_attr("href")
[1] "https://www.johannesbgruber.eu/"

Common Selectors

There are quite a lot of CSS selectors, but often you can stick to just a few:

selector example Selects
element/tag table all <table> elements
class .someTable all elements with class="someTable"
id #table-1 unique element with id="table-1"
element.class tr.headerRow all <tr> elements with the headerRow class
class1.class2 .someTable.blue all elements with the someTable AND blue class
element,element tr,p all <tr> elements AND all <pr> elements
class1 tag .table-1 tr all <tr> elements that are descendants of .table-1
class1 > tag .table-1 > tr all <tr> elements that with .table-1 as parent
class1 + tag .table-1 + a the <a> element directly following an element .table-1

Family Relations

Each html tag can contain other tags. To keep track of the relations we speak of ancestors, descendants, parents, children and siblings.

<book>
  <chapter>
    <section>
      <subsection>
        This is a subsection.
      </subsection>
      <subsection>
        This is another subsection.
      </subsection>
    </section>
    <section>
      This is a section.
    </section>
  </chapter>
  <chapter>
    <section>
      This is a section.
    </section>
    <section>
      This is a section.
    </section>
  </chapter>
  <chapter>
    This is a chapter without sections.
  </chapter>
</book>

Exercises 3

  1. Practice finding the right selector with the CSS Diner game: https://flukeout.github.io/
  2. Consider the toy HTML example below. Which selectors do you need to put into html_elements() (which extracts all elements matching the selector) to extract the information
library(rvest)
webpage <- "<html>
<body>
  <h1>Computational Research in the Post-API Age</h1>
  <div class='author'>Deen Freelon</div>
  <div>Keywords:
    <ul>
      <li>API</li>
      <li>computational</li>
      <li>Facebook</li>
    </ul>
  </div>
  <div class='text'>
    <p>Three pieces of advice on whether and how to scrape from Dan Freelon</p>
  </div>
  
  <ol class='advice'>
    <li id='one'> use authorized methods whenever possible </li>
    <li id='two'> do not confuse terms of service compliance with data protection </li>
    <li id='three'> understand the risks of violating terms of service </li>
  </ol>

</body>
</html>" |> 
  read_html()
# the headline
headline <- html_elements(webpage, "")
headline
# the author
author <- html_elements(webpage, "")
author
# the ordered list
ordered_list <- html_elements(webpage, "")
ordered_list
# all bullet points
bullet_points <- html_elements(webpage, "")
bullet_points
# bullet points in unordered list
bullet_points_unordered <- html_elements(webpage, "")
bullet_points_unordered
# elements in ordered list
bullet_points_ordered <- html_elements(webpage, "")
bullet_points_ordered
# third bullet point in ordered list
bullet_point_three_ordered <- html_elements(webpage, "")
bullet_point_three_ordered

Wrap Up

Save some information about the session for reproducibility.

Show Session Info
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: EndeavourOS

Matrix products: default
BLAS:   /usr/lib/libblas.so.3.12.0 
LAPACK: /usr/lib/liblapack.so.3.12.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/London
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rvest_1.0.4

loaded via a namespace (and not attached):
 [1] vctrs_0.6.5       httr_1.4.7        cli_3.6.3         knitr_1.46       
 [5] rlang_1.1.4       xfun_0.44         stringi_1.8.4     processx_3.8.4   
 [9] promises_1.3.0    jsonlite_1.8.8    glue_1.7.0        selectr_0.4-2    
[13] htmltools_0.5.8.1 ps_1.7.7          fansi_1.0.6       chromote_0.2.0   
[17] rmarkdown_2.26    tibble_3.2.1      evaluate_0.23     fastmap_1.1.1    
[21] yaml_2.3.8        lifecycle_1.0.4   stringr_1.5.1     compiler_4.4.1   
[25] codetools_0.2-20  pkgconfig_2.0.3   Rcpp_1.0.12       websocket_1.4.1  
[29] rstudioapi_0.16.0 later_1.3.2       digest_0.6.35     R6_2.5.1         
[33] utf8_1.2.4        pillar_1.9.0      magrittr_2.0.3    tools_4.4.1      
[37] xml2_1.3.6       

References

Freelon, Deen. 2018. “Computational Research in the Post-API Age.” Political Communication 35 (4): 665–68. https://doi.org/10.1080/10584609.2018.1477506.
Luscombe, Alex, Kevin Dick, and Kevin Walby. 2022. “Algorithmic Thinking in the Public Interest: Navigating Technical, Legal, and Ethical Hurdles to Web Scraping in the Social Sciences.” Quality & Quantity 56 (3): 1023–44. https://doi.org/10.1007/s11135-021-01164-0.